The goal of this file is to document what has been done so far in my Stage en méthodes computationnelles under the supervision of Louis Renaud-Desjardins. Mainly, you will find in this file explanation of methodological choices for our work and the main results we got so far.
Special thanks to François Claveau, Pierre-Olivier Méthot, Louis Renaud-Desjardins, Thomas Pradeu and Maël Lemoine for their support on this projet.
For any questions or issues, please write at jacob.hamel-mottiez.1@ulaval.ca.
/****** Fetching relevant corpora
Project: Biology in philosophy
created by Jacob Hamel-Mottiez and Louis Renaud-Desjarding on 2024-09-23.
// SETUP //
We run many tests on SQL Server. The cleanest data output is when you execute
the query, right-click on the result and save it as a .csv file.
You can easily save on onedrive and fetch the data elsewhere
// DATA STRUCTURE //
We are working with Web of Science (clarivate) database via Albator. The
access is provided
by Vincent Lariviere.
All the information linking cited and citing documents is established via a key called "OST_BK" which is an unique identifier.
There are two ways ways in which you can deal with the data.
1) Take the .dbo which is from WoS ;
2) Take the .pex which is from OST.
What should influence your choice is the following methodological considerations :
The table that links the cited documents with the citing documents is constructed as follow :
1) The cited document must be present in the WOS (Web of Science) corpus.
2) The cited document must be an article, a note, or a review (with a `code_document` less than 4).
3) The documents citing or cited must have been published between 1900 and the present day.
4) The documents citing or cited must have at least one author listed in the `summary_name` table.
In short, if you want all the documents cited for a given citing journal, take .dbo. If you only want the scientific articles, then take .pex because it is cleaner.
// COMPLETNESS OF THE DATA //
When looking at B&P (Code_Revue = 2229), we were able to fetch a total of 67 774 cited documents.
When we looked at which got a Cited_Title, it dropped to 43 828.
When we looked at which had a Cited_Title and an OST_BK, it dropped to 36 335 (~ 31 436 lost in total).
******/
-- alternative table (to verify with Louis)
-- The idea was to select only relevant column in art. and to add Etype doc.
SELECT art.OST_BK
, art.Annee_Bibliographique as citing_year
, art.UID
, art.Titre
, art.Nb_Reference
, doc.EType_Document as Type_Document
, rev.Revue
, rev.Abbrev_11 as Abrev
, abst.Abstract as Abstract
FROM [WoS].[pex].[Article] as art
LEFT JOIN [WoS].[pex].[Liste_revue] as rev
on art.Code_Revue = rev.Code_Revue
LEFT JOIN [WoS].[dbo].[Abstract] as abst
on art.OST_BK = abst.OST_BK
LEFT JOIN [WoS].[pex].[Liste_Document] as doc
on art.Code_Document = doc.Code_Document
WHERE art.Code_Revue = 2229
-- Citing documents (articles published in B&P)
SELECT art.*
,Revue
,Abbrev_11
,abst.Abstract
FROM [WoS].[pex].[Article] as art
LEFT JOIN [WoS].[pex].[Liste_revue] as rev
on art.Code_Revue = rev.Code_Revue
LEFT JOIN [WoS].[dbo].[Abstract] as abst
on art.OST_BK = abst.OST_BK
WHERE art.Code_Revue = 2229
-- Author citing (the authors' of citing documents)
SELECT art.OST_BK
, aut.First_Name
, aut.Last_Name
, aut.Seq_No
FROM [WoS].[pex].[Article] as art
LEFT JOIN [WoS].[dbo].[Summary_Name] as aut
on art.OST_BK = aut.OST_BK
WHERE art.Code_Revue = 2229
-- Cited documents (references of citing documents)
SELECT ref.[OST_BK]
,ref.[UID]
,id.[OST_BK] as OST_BK_Ref
,ref.[UID_Ref]
,ref.[Cited_Author]
,ref.[Year]
,ref.[Cited_Title]
,ref.[Cited_Work]
FROM [WoS].[dbo].[Reference] as ref
LEFT JOIN [WoS].[pex].[Article] as art
ON ref.OST_BK = art.OST_BK
LEFT JOIN [WoS].[dbo].[Dictionnaire_ID] as id
ON ref.UID_Ref = id.UID
WHERE art.Code_Revue = 2229
-- AND ref.[Cited_Work] IS NOT NULL
-- #67 752
-- AND id.[OST_BK] IS NOT NULL
-- # 43 828 out of 67 774.
-- AND ref.[Cited_Title] IS NOT NULL
-- # 36 335 out of 67 774.
/* MEANING OF THE COMMENTED CONDITIONS
Those two conditions enable us to look at how many references
don't have an OST_BK (and thus no information about cited documents)
and which cited documents had no title.
*/
-- Cited abstract (references' abstract)
SELECT DISTINCT id.[OST_BK] as OST_BK_Ref
,ref.[UID_Ref]
,abst.Abstract
FROM [WoS].[dbo].[Reference] as ref
LEFT JOIN [WoS].[pex].[Article] as art
ON ref.OST_BK = art.OST_BK
LEFT JOIN [WoS].[dbo].[Dictionnaire_ID] as id
ON ref.UID_Ref = id.UID
LEFT JOIN [WoS].[dbo].[Abstract] as abst
on id.OST_BK = abst.OST_BK
WHERE art.Code_Revue = 2229
AND abst.Abstract IS NOT NULL
-- alt. cited abstract (references' abstract) (with keywords and Keywords plus)
SELECT DISTINCT id.[OST_BK] as OST_BK_Ref
,ref.[UID_Ref]
,abst.[Abstract]
,keyw.[Keyword]
,keywP.[Keyword] as KeywordP
FROM [WoS].[dbo].[Reference] as ref
LEFT JOIN [WoS].[pex].[Article] as art
ON ref.OST_BK = art.OST_BK
LEFT JOIN [WoS].[dbo].[Dictionnaire_ID] as id
ON ref.UID_Ref = id.UID
LEFT JOIN [WoS].[dbo].[Abstract] as abst
on id.OST_BK = abst.OST_BK
LEFT JOIN [WoS].[dbo].[Keyword] as keyw
on id.OST_BK = keyw.OST_BK
LEFT JOIN [WoS].[dbo].[Keyword_Plus] as keywP
on id.OST_BK = keywP.OST_BK
WHERE art.Code_Revue = 2229
AND abst.Abstract IS NOT NULL
-- Cited authors (references' authors)
SELECT DISTINCT id.[OST_BK] as OST_BK_Ref
,ref.[UID_Ref]
,aut.First_Name
,aut.Last_Name
,aut.Seq_No
FROM [WoS].[dbo].[Reference] as ref
LEFT JOIN [WoS].[pex].[Article] as art
ON ref.OST_BK = art.OST_BK
LEFT JOIN [WoS].[dbo].[Dictionnaire_ID] as id
ON ref.UID_Ref = id.UID
LEFT JOIN [WoS].[dbo].[Summary_Name] as aut
on id.OST_BK = aut.OST_BK
WHERE art.Code_Revue = 2229
AND aut.Seq_No IS NOT NULL
-- The addresses of citing documents
SELECT contrib.*
FROM [WoS].[dbo].Address as contrib
LEFT JOIN [WoS].[pex].[Article] as art
ON contrib.OST_BK = art.OST_BK
WHERE art.Code_Revue = 2229
-- The organizations of citing documents
SELECT contrib.*
FROM [WoS].[dbo].Address_Organization as contrib
LEFT JOIN [WoS].[pex].[Article] as art
ON contrib.OST_BK = art.OST_BK
WHERE art.Code_Revue = 2229
## Some functions to display nice data table and to add percentages automatically
fct_percent <- function(x) {
dt <- x |> mutate(percent = n/sum(x$n, na.rm = TRUE)*100) |>
mutate(across(percent, round, 3))
dt
}
fct_DT <- function(x) {
dt <- DT:: datatable(head(x, 1000),
options = list(scrollX = TRUE,
paging=TRUE,
pageLength = 5))
dt
}
Early in our work came a methodological choice. Namely, choosing between two well known databases : Web of Science (WoS) and Scopus. We chose to go with Scopus. In summary, the choice was made based on two main reasons.
1.Scopus has a wider range of coverage when it comes to the journals we want to investigate, that is the main journals of philosophy of biology. For example, it covers Biological Theory, which is absent of our Web of Science database.
If you want in depth details about both databases and their strenghts and weaknesses given our corpus, follow along. If you want to see the results, you can skip the next section.
Here is how we fetch the information from the different databases.
For Springer, we looked manually at each volume and create an excel sheet with the number of articles per year.
For Web of Science we fetched the data through the Albator database which we got access to via the OST. The detailed information about the SQL query already provided earlier.
For Scopus, we used Scopus API and the rscopus package to get our data. See the code below for the specific workflow.
bio_philo_query <- rscopus::scopus_search("EXACTSRCTITLE(Biology and Philosophy)",
view = "COMPLETE",
headers = insttoken)
bio_philo_raw <- gen_entries_to_df(bio_philo_query$entries)
bio_philo_papers <- read_csv(paste0(dir_od, "bio_philo_papers.csv"))
bio_philo_affiliations <- read_csv(paste0(dir_od, "bio_philo_affiliations.csv"))
bio_philo_authors <- read_csv(paste0(dir_od, "bio_philo_authors.csv"))
citing_articles <- bio_philo_papers$`dc:identifier` # extracting the IDs of our articles
Let’s start by making sure that we have a good coverage for articles.
Lets look at the total number of articles from each database for Biology & Philosophy. Unsuprisingly, the number of article listed in Springer are more numerous than in the other databases. At first sight, Web of Science seems to have a better coverage, with a difference of more than 200 articles when compared to Scopus.
#Table
fct_DT(
art_all |>
group_by(FROM) |>
summarise(total_N = sum(N)) |>
arrange(desc(total_N))
)
Lets look at the distribution of those articles since the beginning of Biology & Philosophy. We see that 1) Web of Science has a good coverage except for early and recent years of B&P whereas Scopus do pretty well in the same range. However, Scopus seems to lose a lot of articles from the decade 1995-2005. However, when we filter only on articles, Scopus gets actually a better coverage overall.
Now that we have a better idea of the articles we are able to get from both databases, lets look at their references.
However, a big problem is the references of those articles. For our following work, these references will be crucial, especially for cocitation network. Indeed, we uncover that many many references had no unique identifier (around 1/3 to potentially 1/2).
This is especially problematic. Here is some quantitative data :
# Completeness of data -----------------------------------------------------
df <- tibble(ref_bp) |> mutate(across(where(is.character), ~ na_if(., "NULL")))
df <- df|> rename(Citing_ID = OST_BK, Cited_ID = OST_BK_Ref, Cited_Year = Year)
# Create a tibble summarizing total rows, NA values, and non-NA values by column
summary_tibble <- tibble(
column = names(df),
total_rows = nrow(df),
na_count = sapply(df, function(x)
sum(is.na(x))),
non_na_count = sapply(df, function(x)
sum(!is.na(x)))
)
summary_tibble <- summary_tibble |>
mutate(percent = non_na_count / total_rows *100) |>
mutate(across(percent, round, 3)) |>
arrange(desc(percent))
summary_tibble_WoS <- summary_tibble |>
filter(column != "UID" & column != "UID_Ref")
summary_tibble_WoS$column <- factor(summary_tibble_WoS$column, levels = c("Citing_ID", "Cited_ID","Cited_Author", "Cited_Year", "Cited_Work", "Cited_Title"))
summary_tibble_WoS <-summary_tibble_WoS |>
mutate(column_adjust = column) |> # this will simplify our work next
mutate(FROM = "Web of Science")
Given the important limitations listed before, we turned ourselves to another database that looked promising : Scopus.
We must thank Aurélien Goutsmedt which made us aware of an API given by Scopus. It eased out our work substantially (for more information, see his blog here.
First, let’s look at Biology & Philosophy in this new database to compare it to our results with Web of Science.
Here, compared to our WoS data, we get more than 68 800 references with close to 100% match between references and articles. As a reminder we had around 64 000 references when going through WoS and almost 1/3 of them had no link to their respective article.
Another reason why it can be interesting to choose Scopus is when looking at its journal coverage. We looked at a journal that is not in WoS, but that was in Scopus : Biological Theory (BT).
Let’s look at the article we are able to fetch with Scopus API compared to the articles listed on Springer for the journal. We created a .csv counting manually all the articles listed on Springer for BT and compared it with what we got with the API.
The first step is to compare the coverage between Springer and Scopus. As we see, both are pretty close Springer getting a little bit less than 60 article more than Scopus.
#Table
fct_DT(
art_all_BT |>
group_by(FROM) |>
summarise(total_N = sum(N)) |>
arrange(desc(total_N))
)
When we look at the specific coverage for each year, we see that the coverage is pretty good. However, 2024 is a strange year where the coverage of Scopus is better than the one of Springer. We don’t understand why at the moment.
As we see, the coverage in Scopus ressemble the one we get from Springer when it comes to articles.
Now, let’s look at the metadata of the references. For the journal Biological Theory, we get 35 793 references.
We have almost 100% non-na entries for a) Cited_Authors, b) Citing_ID, c) Cited_ID, d) Cited_Work which is either the article name or the book name. We should not be too bothered with the fact that the column Cited_Title as around 40% of NA entries since books do not have them typically.
Something that can look more bothersome is the Cited_Year column 40% of NA values. Looking into it, we can easily understand why there is so many NAs.The main reason is because Scopus has done before hand cleaning, notably for books that have many editions. You can demonstrate this by fetching the data directly from Scopus website and compare it to what we get with the API.
# REGEX FOR VARIOUS REFERENCES' EXTRACTION --------------------------------
# Define extraction patterns
extract_authors <- paste0(
"^",
"(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*", # Matches first part of author name allowing hyphens and apostrophes
"(?:[A-Z]+(?:[-'][A-Z]+)*)", # Matches last part of author name allowing hyphens and apostrophes
"\\s+[A-Z](?:\\.[A-Z])*\\.", # Matches initials (e.g., J. or J.A.)
"(?:,\\s+",
"(?:[A-Z]+(?:[-'][A-Z]+)*\\s+)*", # Matches first part of additional author names
"(?:[A-Z]+(?:[-'][A-Z]+)*)", # Matches last part of additional author names
"\\s+[A-Z](?:\\.[A-Z])*\\.", # Matches initials of additional authors
")*"
)
extract_year <- "\\b(\\d{4})\\b"
extract_journal <- "[A-Z][A-Za-z\\s]+(?=\\,\\s\\d)"
extract_volume <- "\\b\\d+\\b(?=\\,|\\s)"
extract_issue <- "(?<=\\,\\s)(?:[A-Z])?\\d+(?=\\,|\\s|\\()|(?<=\\,\\s)[A-Z]\\d+(?=\\,|\\s|\\()"
extract_pages <- "\\bP{0,1}\\.\\s*\\d+(-\\d+)?\\b"
references_extract$references <- toupper(references_extract$references)
extraction <- function(ref) {
# Extract components
year <- str_extract(ref, extract_year)
authors <- str_extract(ref, extract_authors)
journal <- str_extract(ref, extract_journal)
pages <- str_extract(ref, extract_pages)
# Extract volume and issue separately
volume_issue <- str_extract(ref, "\\b\\d{1,4}\\b(,\\s*\\d{1,4})?")
# Split into volume and issue if both are present
if (!is.na(volume_issue)) {
volume_issue_split <- str_split(volume_issue, ",\\s*")[[1]]
volume <- volume_issue_split[1] # First part is the volume
issue <- ifelse(length(volume_issue_split) > 1, volume_issue_split[2], NA) # Second part is the issue, if it exists
} else {
volume <- NA
issue <- NA
}
# Clean up formats
year <- str_trim(year)
pages <- ifelse(!is.na(pages), str_extract(pages, "\\d+(-\\d+)?"), NA)
# Create a vector of extracted components
extracted_parts <- c(authors, year, journal, volume, issue, pages)
# Remove extracted parts and clean the remaining reference
remaining_ref <- ref %>%
str_remove_all(paste0(extracted_parts, collapse = "|")) %>%
str_remove_all(",\\s*") %>%
str_remove_all("\\s*\\(.*?\\)\\s*") %>%
str_remove_all("P\\.\\s*|PP\\.\\s*") %>%
str_remove_all("^\\s*|\\s*$") %>%
str_trim()
tibble(
extracted_year = year,
extracted_authors = authors,
unique_author = authors, # This extra column is to get all unique author for later count.
extracted_journal = journal,
extracted_volume = volume,
extracted_issue = issue,
extracted_pages = pages,
remaining_ref = remaining_ref
)
}
# Apply the function and handle nested results
results <- references_extract %>%
mutate(
extraction_results = map(references, extraction) # Apply function to each reference
) %>%
unnest_wider(extraction_results) # Unnest the tibble returned by `test_extraction`
#write_csv(results2, paste0(dir, "results2.csv"))
# View the results
fct_DT(results)
results_split <- results %>%
separate_rows(unique_author, sep = ",\\s*") # Split authors into multiple rows
fct_DT(results_split)
# SAVE RESULTS ------------------------------------------------------------
write_csv(results, paste0(dir_od, "cleaned_ref.csv"))
write_csv(results_split, paste0(dir_od, "cleaned_ref_split.csv"))
count_ref_art <- results |>
filter(!is.na(extracted_authors)) |>
select(extracted_authors, extracted_year, remaining_ref) |>
add_count(remaining_ref, extracted_authors) |>
unique() |>
arrange(desc(n))
fct_DT(count_ref_art)
Here, we see that the famous book by Richard Dawkins The Selfish Gene has been refered to with different publication years (i.e. 1976 ans 1989). It is also the case for Odling-Smee et al. which gets two different (extracted_year). If We look at the data from Scopus API we see that Dawkins book got no year attributed to it and that they many similar but different entries are under the same unique identifier (scopus_id).
As we see, the completness of the metadata is comparable to what we got with B&P which is more than sufficient.
Now that we have checked that the coverage for got both articles and their references is satisfying, we can dig into some of the results. Let’s start with the articles.
clean_references_bp <- clean_references_bp |> mutate(year = as.Date(year)) |> mutate(year = year(year)) #To get only the year.
most_c_ref_bp <- clean_references_bp |> filter(!is.na(sourcetitle)) |>
select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id) |> arrange(desc(n)) |> distinct()
most_c_ref_bp <- fct_percent(most_c_ref_bp)
fct_DT(most_c_ref_bp)
clean_references_th <- clean_references_th |> mutate(year = as.Date(year)) |> mutate(year = year(year))
most_c_ref <- clean_references_th |> filter(!is.na(sourcetitle)) |> select(scopus_id, author, year, sourcetitle, title, type) |> add_count(sourcetitle, title, author, scopus_id) |> arrange(desc(n)) |> distinct()
most_c_ref_th <- fct_percent(most_c_ref)
fct_DT(most_c_ref_th)
# BIOLOGY & PHILOSOPHY
clean_references_bp <- clean_references_bp |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")
clean_references_bp <- clean_references_bp |>
distinct() |>
filter(!is.na(scopus_id))
clean_references_bp <- clean_references_bp |>
group_by(scopus_id) |>
filter(most_n == max(most_n)) |>
slice_head(n = 1) |>
ungroup()
rank_bp <- clean_references_bp |> mutate(rank_in_bp = dense_rank(-clean_references_bp$n)) |> arrange(desc(n))
# BIOLOGICAL THEORY
clean_references_th <- clean_references_th |> add_count(scopus_id, author, sourcetitle, title, author_list_author_ce_initials, year, name = "most_n")
clean_references_th <- clean_references_th |>
distinct() |>
filter(!is.na(scopus_id))
clean_references_th <- clean_references_th |>
group_by(scopus_id) |>
filter(most_n == max(most_n)) |>
slice_head(n = 1) |>
ungroup()
rank_th <- clean_references_th |> mutate(rank_in_th = dense_rank(-clean_references_th$n)) |> arrange(desc(n))
cited_authors_tbl <- full_join(
rank_th |> select(scopus_id, author, year, sourcetitle, title, rank_in_th),
rank_bp |> select(scopus_id, rank_in_bp),
by = "scopus_id")
fct_DT(cited_authors_tbl)
cited_journals_bp <- clean_references_bp |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle) |> arrange(desc(n))
cited_journals_bp <- fct_percent(cited_journals_bp)
fct_DT(cited_journals_bp)
cited_journals_th <- clean_references_th |> select(scopus_id, sourcetitle, title) |> filter(!is.na(title)) |> count(sourcetitle) |> arrange(desc(n))
cited_journals_th <- fct_percent(cited_journals_th)
fct_DT(cited_journals_th)
journal_rank_bp <- cited_journals_bp |> mutate(rank_in_th = dense_rank(-cited_journals_bp$n)) |> arrange(desc(n))
journal_rank_th <- cited_journals_th |> mutate(rank_in_th = dense_rank(-cited_journals_th$n)) |> arrange(desc(n))
journal_rank_all <- left_join(journal_rank_bp,journal_rank_th, by = "sourcetitle")
keyword_bp <- bio_philo_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)
keyword_bp_cleaned <- keyword_bp |>
separate_rows(authkeywords, sep = " \\| ") |>
filter(!is.na(authkeywords))
keyword_bp_cleaned$authkeywords <- toupper(keyword_bp_cleaned$authkeywords)
# KEYWORD PLUS WORD CLOUD
keyword_bp_count <- keyword_bp_cleaned |>
select(authkeywords) |> count(authkeywords, sort=TRUE)
keyword_bp_count <- fct_percent(keyword_bp_count)
fct_DT(keyword_bp_count)
keyword_th <- bio_th_papers |> select(citing_art, dc_creator, year, authkeywords, prism_publication_name)
keyword_th_cleaned <- keyword_th |>
separate_rows(authkeywords, sep = " \\| ") |>
filter(!is.na(authkeywords))
keyword_th_cleaned$authkeywords <- toupper(keyword_th_cleaned$authkeywords)
# KEYWORD PLUS WORD CLOUD
keyword_th_count <- keyword_th_cleaned |>
group_by(authkeywords) |> count(authkeywords, sort=TRUE)
keyword_th_count <- fct_percent(keyword_th_count)
fct_DT(keyword_th_count)
Now that we have these tables, it might be of use to visualize those keywords and their importance.
An important thing to note is that we don’t have access to the keywords of the references provided by Scopus
Here, we compute what we call citation delay. It is computed as the difference between the article publishing year and the mean year of the references the article cites. Here is the cumulative distribution function that shows the evolution of this citation delay as the journal gets older.
# BIOLOGY & PHILOSOPHY
bio_philo_papers <- bio_philo_papers |> mutate(date = as.Date(prism_cover_date)) |>
mutate(year = year(date))
delay_refs_bp <- clean_references_bp |>
rename(cited_year = year) |>
left_join(bio_philo_papers |> select(citing_art, year),
by = "citing_art") |>
arrange(desc(citing_art))
delay_refs_bp <- delay_refs_bp |> mutate(delay = year-cited_year) |> mutate(from = "B&P")
delay_refs_bp$decade <- cut(delay_refs_bp$year,
breaks = c(1986, 1995, 2004, 2013, 2022, 2025), # Include up to 2024
labels = c("1986-1994", "1995-2003", "2004-2012", "2013-2021", "2022-2024"),
right = FALSE) # Left-inclusive
p1 <- ggplot(delay_refs_bp |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step", show.legend = FALSE) +
labs(title = "Citation Delay Biology & Philosophy (1987-2022)",
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(limits = c(0, 50))
# BIOLOGICAL THEORY
delay_refs_th <- clean_references_th |>
rename(cited_year = year) |>
left_join(bio_th_papers |>
select(citing_art, year), by = "citing_art") |>
arrange(desc(citing_art))
delay_refs_th <- delay_refs_th |> mutate(delay = year-cited_year) |> mutate(from = "BT")
unique(delay_refs_th$year)
## [1] 2024 2023 NA 2022 2013 2014 2006 2018 2011 2019 2007 2008 2015 2009 2010
## [16] 2020 2017 2021 2016
delay_refs_th$decade <- cut(delay_refs_th$year,
breaks = c(2006, 2014, 2023),
labels = c("2006-2013", "2014-2021"),
right = FALSE) # left-inclusive
p2 <- ggplot(delay_refs_th |> filter(!is.na(decade)), aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step", show.legend = FALSE) +
labs(title = "Citation Delay Biological Theory (2006-2022)",
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(limits = c(0, 50))
# BOTH B&P AND BT
all <- rbind(delay_refs_bp, delay_refs_th) |> filter(!is.na(decade))
p3 <- ggplot(all, aes(x = delay, color = decade, group = decade)) +
stat_ecdf(geom = "step") +
labs(
x = "Delay",
y = "CDF") +
theme(plot.title = element_text(hjust = 0.5),
legend.position="top", legend.title = element_blank(), ) +
scale_x_continuous(limits = c(0, 50)) +
facet_grid(rows = ~ from)
ggplotly(p3) |> layout(legend = list(title = FALSE, orientation = "h", # show entries horizontally
xanchor = "center", # use center of legend as anchor
x = 0.5, y = 1.2))
ggplotly(p1)
If this shift is interesting, we need to be careful. It could be only because as we go, we still cite old stuff, thus creating an artificial shift to the right not really problematif. Let’s look at the distribution of citation delay.
p4 <- all |> ggplot(aes(x = delay, group = decade, fill = from, color = from)) +
geom_density(alpha = 0.5) +
facet_grid(rows = vars(from), cols = vars(decade))
ggplotly(p4) |> layout(legend = list(title = FALSE, orientation = "h", # show entries horizontally
xanchor = "center", # use center of legend as anchor
x = 0.5, y = 1.2))